This last discussion (we'll do an in-class on Thursday) will focus on two last essential bricks of data analysis: (a) Variable selection and, as an optional (yet useful topic in predictions and Machine learning) (b) Random Forests.
Your objectives for this week (no video this time!):
"Choose well. Your choice is brief, and yet endless."
--- Johann Wolfgang von Goethe
After reading this chapter you will be able to:
We discussed during the discussion on Monday the need to select a subset of predictors among many, in particular to avoid overfitting and to improve on the interpretability of our model --- aspects which are often essential in biology. This chapter, we will discuss several criteria and procedures for choosing a "good" model from among a choice of many.
To give a little bit of motivation, here is the setting that we are considering. Suppose you have a dataset with \(n\) observations of \(p+1\) variables, and you would like to build a model to predict one of them (let's call it \(y\) ) based on the remaining \(p\). But \(p\) is very large: suppose you are looking at genes, and \(y\) is a binary variable (eg, cancer/ not cancer). Ideally, you would like to find a subset \(p_1 <<< p\) of variables that you can associate with cancer. But how do you find that subset?
The statistical recipe for this setting has two components, which are subsequently discussed:
So far, we have seen criteria such as \(R^2\) and \(\text{RMSE}\) for assessing quality of fit (see the labs on linear regression). However, both of these have a fatal flaw. By increasing the size of a model, that is adding predictors, that can at worst not improve. It is impossible to add a predictor to a model and make \(R^2\) or \(\text{RMSE}\) worse. That means, if we were to use either of these to chose between models, we would always simply choose the larger model. Eventually we would simply be fitting to noise.
This suggests that we need a quality criteria that takes into account the size of the model, since our preference is for small models that still fit well. We are willing to sacrifice a small amount of "goodness-of-fit" for obtaining a smaller model. (Here we use "goodness-of-fit" to simply mean how far the data is from the model, the smaller the errors the better. Often in statistics, goodness-of-fit can have a more precise meaning.) We will look at three criteria that do this explicitly: \(\text{AIC}\), \(\text{BIC}\), and Adjusted \(R^2\). We will also look at one, Cross-Validated \(\text{RMSE}\), which implicitly considers the size of the model.
The first criteria we will discuss is the Akaike Information Criterion, or \(\text{AIC}\) for short. (Note that, when Akaike first introduced this metric, it was simply called An Information Criterion. The A has changed meaning over the years.)
Recall, the maximized log-likelihood of a regression model can be written as
\[ \log L(\boldsymbol{\hat{\beta}}, \hat{\sigma}^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\left(\frac{\text{RSS}}{n}\right) - \frac{n}{2}, \]
where \(\text{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i) ^ 2\) and \(\boldsymbol{\hat{\beta}}\) and \(\hat{\sigma}^2\) were chosen to maximize the likelihood.
Then we can define \(\text{AIC}\) as
\[ \text{AIC} = -2 \log L(\boldsymbol{\hat{\beta}}, \hat{\sigma}^2) + 2p = n + n \log(2\pi) + n \log\left(\frac{\text{RSS}}{n}\right) + 2p, \]
which is a measure of quality of the model. The smaller the \(\text{AIC}\), the better. To see why, let's talk about the two main components of \(\text{AIC}\), the likelihood (which measures "goodness-of-fit") and the penalty (which is a function of the size of the model).
The likelihood portion of \(\text{AIC}\) is given by
\[ -2 \log L(\boldsymbol{\hat{\beta}}, \hat{\sigma}^2) = n + n \log(2\pi) + n \log\left(\frac{\text{RSS}}{n}\right). \]
For the sake of comparing models, the only term here that will change is \(n \log\left(\frac{\text{RSS}}{n}\right)\), which is function of \(\text{RSS}\). The
\[ n + n \log(2\pi) \]
terms will be constant across all models applied to the same data. So, when a model fits well, that is, has a low \(\text{RSS}\), then this likelihood component will be small.
Similarly, we can discuss the penalty component of \(\text{AIC}\) which is,
\[ 2p, \]
where \(p\) is the number of \(\beta\) parameters in the model. We call this a penalty, because it is large when \(p\) is large, but we are seeking to find a small \(\text{AIC}\)
Thus, a good model, that is one with a small \(\text{AIC}\), will have a good balance between fitting well, and using a small number of parameters. For comparing models
\[ \text{AIC} = n\log\left(\frac{\text{RSS}}{n}\right) + 2p \]
is a sufficient expression, as \(n + n \log(2\pi)\) is the same across all models for any particular dataset.
The Bayesian Information Criterion, or \(\text{BIC}\), is similar to \(\text{AIC}\), but has a larger penalty. \(\text{BIC}\) also quantifies the trade-off between a model which fits well and the number of model parameters, however for a reasonable sample size, generally picks a smaller model than \(\text{AIC}\). Again, for model selection use the model with the smallest \(\text{BIC}\).
\[ \text{BIC} = -2 \log L(\boldsymbol{\hat{\beta}}, \hat{\sigma}^2) + \log(n) p = n + n\log(2\pi) + n\log\left(\frac{\text{RSS}}{n}\right) + \log(n)p. \]
Notice that the \(\text{AIC}\) penalty was
\[ 2p, \]
whereas for \(\text{BIC}\), the penalty is
\[ \log(n) p. \]
So, for any dataset where \(log(n) > 2\) the \(\text{BIC}\) penalty will be larger than the \(\text{AIC}\) penalty, thus \(\text{BIC}\) will likely prefer a smaller model.
Note that, sometimes the penalty is considered a general expression of the form
\[ k \cdot p. \]
Then, for \(\text{AIC}\) \(k = 2\), and for \(\text{BIC}\) \(k = \log(n)\).
For comparing models
\[ \text{BIC} = n\log\left(\frac{\text{RSS}}{n}\right) + \log(n)p \]
is again a sufficient expression, as \(n + n \log(2\pi)\) is the same across all models for any particular dataset.
Recall,
\[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}. \]
We now define
\[ R_a^2 = 1 - \frac{\text{SSE}/(n-p)}{\text{SST}/(n-1)} = 1 - \left( \frac{n-1}{n-p} \right)(1-R^2) \]
which we call the Adjusted \(R^2\).
Unlike \(R^2\) which can never become smaller with added predictors, Adjusted \(R^2\) effectively penalizes for additional predictors, and can decrease with added predictors. Like \(R^2\), larger is still better.
Each of the previous three metrics explicitly used \(p\), the number of parameters, in their calculations. Thus, they all explicitly limit the size of models chosen when used to compare models.
We'll now briefly introduce overfitting and cross-validation.
make_poly_data = function(sample_size = 11) {
x = seq(0, 10)
y = 3 + x + 4 * x ^ 2 + rnorm(n = sample_size, mean = 0, sd = 20)
data.frame(x, y)
}
set.seed(1234)
poly_data = make_poly_data()
Here we have generated data where the mean of \(Y\) is a quadratic function of a single predictor \(x\), specifically,
\[ Y = 3 + x + 4 x ^ 2 + \epsilon. \]
We'll now fit two models to this data, one which has the correct form, quadratic, and one that is large, which includes terms up to and including an eighth degree.
fit_quad = lm(y ~ poly(x, degree = 2), data = poly_data)
fit_big = lm(y ~ poly(x, degree = 8), data = poly_data)
We then plot the data and the results of the two models.
plot(y ~ x, data = poly_data, ylim = c(-100, 400), cex = 2, pch = 20)
xplot = seq(0, 10, by = 0.1)
lines(xplot, predict(fit_quad, newdata = data.frame(x = xplot)),
col = "dodgerblue", lwd = 2, lty = 1)
lines(xplot, predict(fit_big, newdata = data.frame(x = xplot)),
col = "darkorange", lwd = 2, lty = 2)
We can see that the solid blue curve models this data rather nicely. The dashed orange curve fits the points better, making smaller errors, however it is unlikely that it is correctly modeling the true relationship between \(x\) and \(y\). It is fitting the random noise. This is an example of overfitting.
We see that the larger model indeed has a lower \(\text{RMSE}\).
sqrt(mean(resid(fit_quad) ^ 2))
## [1] 17.61812
sqrt(mean(resid(fit_big) ^ 2))
## [1] 10.4197
To correct for this, we will introduce cross-validation. We define the leave-one-out cross-validated RMSE to be
\[ \text{RMSE}_{\text{LOOCV}} = \sqrt{\frac{1}{n} \sum_{i=1}^n e_{[i]}^2}. \]
The \(e_{[i]}\) are the residual for the \(i\)th observation, when that observation is not used to fit the model.
\[ e_{[i]} = y_{i} - \hat{y}_{[i]} \]
That is, the fitted value is calculated as
\[ \hat{y}_{[i]} = \boldsymbol{x}_i ^ \top \hat{\beta}_{[i]} \]
where \(\hat{\beta}_{[i]}\) are the estimated coefficients when the \(i\)th observation is removed from the dataset.
In general, to perform this calculation, we would be required to fit the model \(n\) times, once with each possible observation removed. However, for leave-one-out cross-validation and linear models, the equation can be rewritten as
\[ \text{RMSE}_{\text{LOOCV}} = \sqrt{\frac{1}{n}\sum_{i=1}^n \left(\frac{e_{i}}{1-h_{i}}\right)^2}, \]
where \(h_i\) are the leverages and \(e_i\) are the usual residuals. This is great, because now we can obtain the LOOCV \(\text{RMSE}\) by fitting only one model! In practice 5 or 10 fold cross-validation are much more popular. For example, in 5-fold cross-validation, the model is fit 5 times, each time leaving out a fifth of the data, then predicting on those values. We'll leave in-depth examination of cross-validation to a machine learning course, and simply use LOOCV here.
Let's calculate LOOCV \(\text{RMSE}\) for both models, then discuss why we want to do so. We first write a function which calculates the LOOCV \(\text{RMSE}\) as defined using the shortcut formula for linear models.
calc_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
Then calculate the metric for both models.
calc_loocv_rmse(fit_quad)
## [1] 23.57189
calc_loocv_rmse(fit_big)
## [1] 1334.357
Now we see that the quadratic model has a much smaller LOOCV \(\text{RMSE}\), so we would prefer this quadratic model. This is because the large model has severely over-fit the data. By leaving a single data point out and fitting the large model, the resulting fit is much different than the fit using all of the data. For example, let's leave out the third data point and fit both models, then plot the result.
fit_quad_removed = lm(y ~ poly(x, degree = 2), data = poly_data[-3, ])
fit_big_removed = lm(y ~ poly(x, degree = 8), data = poly_data[-3, ])
plot(y ~ x, data = poly_data, ylim = c(-100, 400), cex = 2, pch = 20)
xplot = seq(0, 10, by = 0.1)
lines(xplot, predict(fit_quad_removed, newdata = data.frame(x = xplot)),
col = "dodgerblue", lwd = 2, lty = 1)
lines(xplot, predict(fit_big_removed, newdata = data.frame(x = xplot)),
col = "darkorange", lwd = 2, lty = 2)
We see that on average, the solid blue line for the quadratic model has similar errors as before. It has changed very slightly. However, the dashed orange line for the large model, has a huge error at the point that was removed and is much different that the previous fit.
This is the purpose of cross-validation. By assessing how the model fits points that were not used to perform the regression, we get an idea of how well the model will work for future observations. It assess how well the model works in general, not simply on the observed data.
We've now seen a number of model quality criteria, but now we need to address which models to consider. Model selection involves both a quality criterion, plus a search procedure.
library(faraway)
hipcenter_mod = lm(hipcenter ~ ., data = seatpos)
coef(hipcenter_mod)
## (Intercept) Age Weight HtShoes Ht Seated
## 436.43212823 0.77571620 0.02631308 -2.69240774 0.60134458 0.53375170
## Arm Thigh Leg
## -1.32806864 -1.14311888 -6.43904627
Let's return to the seatpos data from the faraway package. Now, let's consider only models with first order terms, thus no interactions and no polynomials. There are eight predictors in this model. So if we consider all possible models, ranging from using 0 predictors, to all eight predictors, there are
\[ \sum_{k = 0}^{p - 1} {{p - 1} \choose {k}} = 2 ^ {p - 1} = 2 ^ 8 = 256 \]
possible models.
If we had 10 or more predictors, we would already be considering over 1000 models! For this reason, we often search through possible models in an intelligent way, bypassing some models that are unlikely to be considered good. We will consider three search procedures: backwards, forwards, and stepwise.
Backward selection procedures start with all possible predictors in the model, then considers how deleting a single predictor will effect a chosen metric. Let's try this on the seatpos data. We will use the step() function in R which by default uses \(\text{AIC}\) as its metric of choice.
hipcenter_mod_back_aic = step(hipcenter_mod, direction = "backward")
## Start: AIC=283.62
## hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh +
## Leg
##
## Df Sum of Sq RSS AIC
## - Ht 1 5.01 41267 281.63
## - Weight 1 8.99 41271 281.63
## - Seated 1 28.64 41290 281.65
## - HtShoes 1 108.43 41370 281.72
## - Arm 1 164.97 41427 281.78
## - Thigh 1 262.76 41525 281.87
## <none> 41262 283.62
## - Age 1 2632.12 43894 283.97
## - Leg 1 2654.85 43917 283.99
##
## Step: AIC=281.63
## hipcenter ~ Age + Weight + HtShoes + Seated + Arm + Thigh + Leg
##
## Df Sum of Sq RSS AIC
## - Weight 1 11.10 41278 279.64
## - Seated 1 30.52 41297 279.66
## - Arm 1 160.50 41427 279.78
## - Thigh 1 269.08 41536 279.88
## - HtShoes 1 971.84 42239 280.51
## <none> 41267 281.63
## - Leg 1 2664.65 43931 282.01
## - Age 1 2808.52 44075 282.13
##
## Step: AIC=279.64
## hipcenter ~ Age + HtShoes + Seated + Arm + Thigh + Leg
##
## Df Sum of Sq RSS AIC
## - Seated 1 35.10 41313 277.67
## - Arm 1 156.47 41434 277.78
## - Thigh 1 285.16 41563 277.90
## - HtShoes 1 975.48 42253 278.53
## <none> 41278 279.64
## - Leg 1 2661.39 43939 280.01
## - Age 1 3011.86 44290 280.31
##
## Step: AIC=277.67
## hipcenter ~ Age + HtShoes + Arm + Thigh + Leg
##
## Df Sum of Sq RSS AIC
## - Arm 1 172.02 41485 275.83
## - Thigh 1 344.61 41658 275.99
## - HtShoes 1 1853.43 43166 277.34
## <none> 41313 277.67
## - Leg 1 2871.07 44184 278.22
## - Age 1 2976.77 44290 278.31
##
## Step: AIC=275.83
## hipcenter ~ Age + HtShoes + Thigh + Leg
##
## Df Sum of Sq RSS AIC
## - Thigh 1 472.8 41958 274.26
## <none> 41485 275.83
## - HtShoes 1 2340.7 43826 275.92
## - Age 1 3501.0 44986 276.91
## - Leg 1 3591.7 45077 276.98
##
## Step: AIC=274.26
## hipcenter ~ Age + HtShoes + Leg
##
## Df Sum of Sq RSS AIC
## <none> 41958 274.26
## - Age 1 3108.8 45067 274.98
## - Leg 1 3476.3 45434 275.28
## - HtShoes 1 4218.6 46176 275.90
We start with the model hipcenter ~ ., which is otherwise known as hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + Leg. R will then repeatedly attempt to delete a predictor until it stops, or reaches the model hipcenter ~ 1, which contains no predictors.
At each "step", R reports the current model, its \(\text{AIC}\), and the possible steps with their \(\text{RSS}\) and more importantly \(\text{AIC}\).
In this example, at the first step, the current model is hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + Leg which has an AIC of 283.62. Note that when R is calculating this value, it is using extractAIC(), which uses the expression
\[ \text{AIC} = n\log\left(\frac{\text{RSS}}{n}\right) + 2p, \]
which we quickly verify.
extractAIC(hipcenter_mod) # returns both p and AIC
## [1] 9.000 283.624
n = length(resid(hipcenter_mod))
(p = length(coef(hipcenter_mod)))
## [1] 9
n * log(mean(resid(hipcenter_mod) ^ 2)) + 2 * p
## [1] 283.624
Returning to the first step, R then gives us a row which shows the effect of deleting each of the current predictors. The - signs at the beginning of each row indicates we are considering removing a predictor. There is also a row with <none> which is a row for keeping the current model. Notice that this row has the smallest \(\text{RSS}\), as it is the largest model.
We see that every row above <none> has a smaller \(\text{AIC}\) than the row for <none> with the one at the top, Ht, giving the lowest \(\text{AIC}\). Thus we remove Ht from the model, and continue the process.
Notice, in the second step, we start with the model hipcenter ~ Age + Weight + HtShoes + Seated + Arm + Thigh + Leg and the variable Ht is no longer considered.
We continue the process until we reach the model hipcenter ~ Age + HtShoes + Leg. At this step, the row for <none> tops the list, as removing any additional variable will not improve the \(\text{AIC}\) This is the model which is stored in hipcenter_mod_back_aic.
coef(hipcenter_mod_back_aic)
## (Intercept) Age HtShoes Leg
## 456.2136538 0.5998327 -2.3022555 -6.8297461
We could also search through the possible models in a backwards fashion using \(\text{BIC}\). To do so, we again use the step() function, but now specify k = log(n), where n stores the number of observations in the data.
n = length(resid(hipcenter_mod))
hipcenter_mod_back_bic = step(hipcenter_mod, direction = "backward", k = log(n))
## Start: AIC=298.36
## hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh +
## Leg
##
## Df Sum of Sq RSS AIC
## - Ht 1 5.01 41267 294.73
## - Weight 1 8.99 41271 294.73
## - Seated 1 28.64 41290 294.75
## - HtShoes 1 108.43 41370 294.82
## - Arm 1 164.97 41427 294.88
## - Thigh 1 262.76 41525 294.97
## - Age 1 2632.12 43894 297.07
## - Leg 1 2654.85 43917 297.09
## <none> 41262 298.36
##
## Step: AIC=294.73
## hipcenter ~ Age + Weight + HtShoes + Seated + Arm + Thigh + Leg
##
## Df Sum of Sq RSS AIC
## - Weight 1 11.10 41278 291.10
## - Seated 1 30.52 41297 291.12
## - Arm 1 160.50 41427 291.24
## - Thigh 1 269.08 41536 291.34
## - HtShoes 1 971.84 42239 291.98
## - Leg 1 2664.65 43931 293.47
## - Age 1 2808.52 44075 293.59
## <none> 41267 294.73
##
## Step: AIC=291.1
## hipcenter ~ Age + HtShoes + Seated + Arm + Thigh + Leg
##
## Df Sum of Sq RSS AIC
## - Seated 1 35.10 41313 287.50
## - Arm 1 156.47 41434 287.61
## - Thigh 1 285.16 41563 287.73
## - HtShoes 1 975.48 42253 288.35
## - Leg 1 2661.39 43939 289.84
## - Age 1 3011.86 44290 290.14
## <none> 41278 291.10
##
## Step: AIC=287.5
## hipcenter ~ Age + HtShoes + Arm + Thigh + Leg
##
## Df Sum of Sq RSS AIC
## - Arm 1 172.02 41485 284.02
## - Thigh 1 344.61 41658 284.18
## - HtShoes 1 1853.43 43166 285.53
## - Leg 1 2871.07 44184 286.41
## - Age 1 2976.77 44290 286.50
## <none> 41313 287.50
##
## Step: AIC=284.02
## hipcenter ~ Age + HtShoes + Thigh + Leg
##
## Df Sum of Sq RSS AIC
## - Thigh 1 472.8 41958 280.81
## - HtShoes 1 2340.7 43826 282.46
## - Age 1 3501.0 44986 283.46
## - Leg 1 3591.7 45077 283.54
## <none> 41485 284.02
##
## Step: AIC=280.81
## hipcenter ~ Age + HtShoes + Leg
##
## Df Sum of Sq RSS AIC
## - Age 1 3108.8 45067 279.89
## - Leg 1 3476.3 45434 280.20
## <none> 41958 280.81
## - HtShoes 1 4218.6 46176 280.81
##
## Step: AIC=279.89
## hipcenter ~ HtShoes + Leg
##
## Df Sum of Sq RSS AIC
## - Leg 1 3038.8 48105 278.73
## <none> 45067 279.89
## - HtShoes 1 5004.4 50071 280.25
##
## Step: AIC=278.73
## hipcenter ~ HtShoes
##
## Df Sum of Sq RSS AIC
## <none> 48105 278.73
## - HtShoes 1 83534 131639 313.35
The procedure is exactly the same, except at each step we look to improve the \(\text{BIC}\), which R still labels \(\text{AIC}\) in the output.
The variable hipcenter_mod_back_bic stores the model chosen by this procedure.
coef(hipcenter_mod_back_bic)
## (Intercept) HtShoes
## 565.592659 -4.262091
We note that this model is smaller, has fewer predictors, than the model chosen by \(\text{AIC}\), which is what we would expect. Also note that while both models are different, neither uses both Ht and HtShoes which are extremely correlated.
We can use information from the summary() function to compare their Adjusted \(R^2\) values. Note that either selected model performs better than the original full model.
summary(hipcenter_mod)$adj.r.squared
## [1] 0.6000855
summary(hipcenter_mod_back_aic)$adj.r.squared
## [1] 0.6531427
summary(hipcenter_mod_back_bic)$adj.r.squared
## [1] 0.6244149
We can also calculate the LOOCV \(\text{RMSE}\) for both selected models, as well as the full model.
calc_loocv_rmse(hipcenter_mod)
## [1] 44.44564
calc_loocv_rmse(hipcenter_mod_back_aic)
## [1] 37.58473
calc_loocv_rmse(hipcenter_mod_back_bic)
## [1] 37.40564
We see that we would prefer the model chosen via \(\text{BIC}\) if using LOOCV \(\text{RMSE}\) as our metric.
Forward selection is the exact opposite of backwards selection. Here we tell R to start with a model using no predictors, that is hipcenter ~ 1, then at each step R will attempt to add a predictor until it finds a good model or reaches hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + Leg.
hipcenter_mod_start = lm(hipcenter ~ 1, data = seatpos)
hipcenter_mod_forw_aic = step(
hipcenter_mod_start,
scope = hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + Leg,
direction = "forward")
## Start: AIC=311.71
## hipcenter ~ 1
##
## Df Sum of Sq RSS AIC
## + Ht 1 84023 47616 275.07
## + HtShoes 1 83534 48105 275.45
## + Leg 1 81568 50071 276.98
## + Seated 1 70392 61247 284.63
## + Weight 1 53975 77664 293.66
## + Thigh 1 46010 85629 297.37
## + Arm 1 45065 86574 297.78
## <none> 131639 311.71
## + Age 1 5541 126098 312.07
##
## Step: AIC=275.07
## hipcenter ~ Ht
##
## Df Sum of Sq RSS AIC
## + Leg 1 2781.10 44835 274.78
## <none> 47616 275.07
## + Age 1 2353.51 45262 275.14
## + Weight 1 195.86 47420 276.91
## + Seated 1 101.56 47514 276.99
## + Arm 1 75.78 47540 277.01
## + HtShoes 1 25.76 47590 277.05
## + Thigh 1 4.63 47611 277.06
##
## Step: AIC=274.78
## hipcenter ~ Ht + Leg
##
## Df Sum of Sq RSS AIC
## + Age 1 2896.60 41938 274.24
## <none> 44835 274.78
## + Arm 1 522.72 44312 276.33
## + Weight 1 445.10 44390 276.40
## + HtShoes 1 34.11 44801 276.75
## + Thigh 1 32.96 44802 276.75
## + Seated 1 1.12 44834 276.78
##
## Step: AIC=274.24
## hipcenter ~ Ht + Leg + Age
##
## Df Sum of Sq RSS AIC
## <none> 41938 274.24
## + Thigh 1 372.71 41565 275.90
## + Arm 1 257.09 41681 276.01
## + Seated 1 121.26 41817 276.13
## + Weight 1 46.83 41891 276.20
## + HtShoes 1 13.38 41925 276.23
Again, by default R uses \(\text{AIC}\) as its quality metric when using the step() function. Also note that now the rows begin with a + which indicates addition of predictors to the current model from any step.
hipcenter_mod_forw_bic = step(
hipcenter_mod_start,
scope = hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + Leg,
direction = "forward", k = log(n))
## Start: AIC=313.35
## hipcenter ~ 1
##
## Df Sum of Sq RSS AIC
## + Ht 1 84023 47616 278.34
## + HtShoes 1 83534 48105 278.73
## + Leg 1 81568 50071 280.25
## + Seated 1 70392 61247 287.91
## + Weight 1 53975 77664 296.93
## + Thigh 1 46010 85629 300.64
## + Arm 1 45065 86574 301.06
## <none> 131639 313.35
## + Age 1 5541 126098 315.35
##
## Step: AIC=278.34
## hipcenter ~ Ht
##
## Df Sum of Sq RSS AIC
## <none> 47616 278.34
## + Leg 1 2781.10 44835 279.69
## + Age 1 2353.51 45262 280.05
## + Weight 1 195.86 47420 281.82
## + Seated 1 101.56 47514 281.90
## + Arm 1 75.78 47540 281.92
## + HtShoes 1 25.76 47590 281.96
## + Thigh 1 4.63 47611 281.98
We can make the same modification as last time to instead use \(\text{BIC}\) with forward selection.
summary(hipcenter_mod)$adj.r.squared
## [1] 0.6000855
summary(hipcenter_mod_forw_aic)$adj.r.squared
## [1] 0.6533055
summary(hipcenter_mod_forw_bic)$adj.r.squared
## [1] 0.6282374
We can compare the two selected models' Adjusted \(R^2\) as well as their LOOCV \(\text{RMSE}\) The results are very similar to those using backwards selection, although the models are not exactly the same.
calc_loocv_rmse(hipcenter_mod)
## [1] 44.44564
calc_loocv_rmse(hipcenter_mod_forw_aic)
## [1] 37.62516
calc_loocv_rmse(hipcenter_mod_forw_bic)
## [1] 37.2511
Stepwise search checks going both backwards and forwards at every step. It considers the addition of any variable not currently in the model, as well as the removal of any variable currently in the model.
Here we perform stepwise search using \(\text{AIC}\) as our metric. We start with the model hipcenter ~ 1 and search up to hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + Leg. Notice that at many of the steps, some row begin with -, while others begin with +.
hipcenter_mod_both_aic = step(
hipcenter_mod_start,
scope = hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + Leg,
direction = "both")
## Start: AIC=311.71
## hipcenter ~ 1
##
## Df Sum of Sq RSS AIC
## + Ht 1 84023 47616 275.07
## + HtShoes 1 83534 48105 275.45
## + Leg 1 81568 50071 276.98
## + Seated 1 70392 61247 284.63
## + Weight 1 53975 77664 293.66
## + Thigh 1 46010 85629 297.37
## + Arm 1 45065 86574 297.78
## <none> 131639 311.71
## + Age 1 5541 126098 312.07
##
## Step: AIC=275.07
## hipcenter ~ Ht
##
## Df Sum of Sq RSS AIC
## + Leg 1 2781 44835 274.78
## <none> 47616 275.07
## + Age 1 2354 45262 275.14
## + Weight 1 196 47420 276.91
## + Seated 1 102 47514 276.99
## + Arm 1 76 47540 277.01
## + HtShoes 1 26 47590 277.05
## + Thigh 1 5 47611 277.06
## - Ht 1 84023 131639 311.71
##
## Step: AIC=274.78
## hipcenter ~ Ht + Leg
##
## Df Sum of Sq RSS AIC
## + Age 1 2896.6 41938 274.24
## <none> 44835 274.78
## - Leg 1 2781.1 47616 275.07
## + Arm 1 522.7 44312 276.33
## + Weight 1 445.1 44390 276.40
## + HtShoes 1 34.1 44801 276.75
## + Thigh 1 33.0 44802 276.75
## + Seated 1 1.1 44834 276.78
## - Ht 1 5236.3 50071 276.98
##
## Step: AIC=274.24
## hipcenter ~ Ht + Leg + Age
##
## Df Sum of Sq RSS AIC
## <none> 41938 274.24
## - Age 1 2896.6 44835 274.78
## - Leg 1 3324.2 45262 275.14
## - Ht 1 4238.3 46176 275.90
## + Thigh 1 372.7 41565 275.90
## + Arm 1 257.1 41681 276.01
## + Seated 1 121.3 41817 276.13
## + Weight 1 46.8 41891 276.20
## + HtShoes 1 13.4 41925 276.23
We could again instead use \(\text{BIC}\) as our metric.
hipcenter_mod_both_bic = step(
hipcenter_mod_start,
scope = hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + Leg,
direction = "both", k = log(n))
## Start: AIC=313.35
## hipcenter ~ 1
##
## Df Sum of Sq RSS AIC
## + Ht 1 84023 47616 278.34
## + HtShoes 1 83534 48105 278.73
## + Leg 1 81568 50071 280.25
## + Seated 1 70392 61247 287.91
## + Weight 1 53975 77664 296.93
## + Thigh 1 46010 85629 300.64
## + Arm 1 45065 86574 301.06
## <none> 131639 313.35
## + Age 1 5541 126098 315.35
##
## Step: AIC=278.34
## hipcenter ~ Ht
##
## Df Sum of Sq RSS AIC
## <none> 47616 278.34
## + Leg 1 2781 44835 279.69
## + Age 1 2354 45262 280.05
## + Weight 1 196 47420 281.82
## + Seated 1 102 47514 281.90
## + Arm 1 76 47540 281.92
## + HtShoes 1 26 47590 281.96
## + Thigh 1 5 47611 281.98
## - Ht 1 84023 131639 313.35
Adjusted \(R^2\) and LOOCV \(\text{RMSE}\) comparisons are similar to backwards and forwards, which is not at all surprising, as some of the models selected are the same as before.
summary(hipcenter_mod)$adj.r.squared
## [1] 0.6000855
summary(hipcenter_mod_both_aic)$adj.r.squared
## [1] 0.6533055
summary(hipcenter_mod_both_bic)$adj.r.squared
## [1] 0.6282374
calc_loocv_rmse(hipcenter_mod)
## [1] 44.44564
calc_loocv_rmse(hipcenter_mod_both_aic)
## [1] 37.62516
calc_loocv_rmse(hipcenter_mod_both_bic)
## [1] 37.2511
Backward, forward, and stepwise search are all useful, but do have an obvious issue. By not checking every possible model, sometimes they will miss the best possible model. With an extremely large number of predictors, sometimes this is necessary since checking every possible model would be rather time consuming, even with current computers.
However, with a reasonably sized dataset, it isn't too difficult to check all possible models. To do so, we will use the regsubsets() function in the R package leaps.
library(leaps)
all_hipcenter_mod = summary(regsubsets(hipcenter ~ ., data = seatpos))
A few points about this line of code. First, note that we immediately use summary() and store those results. That is simply the intended use of regsubsets(). Second, inside of regsubsets() we specify the model hipcenter ~ .. This will be the largest model considered, that is the model using all first-order predictors, and R will check all possible subsets.
We'll now look at the information stored in all_hipcenter_mod.
all_hipcenter_mod$which
## (Intercept) Age Weight HtShoes Ht Seated Arm Thigh Leg
## 1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## 3 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## 4 TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
## 5 TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
## 6 TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## 7 TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Using $which gives us the best model, according to \(\text{RSS}\), for a model of each possible size, in this case ranging from one to eight predictors. For example the best model with four predictors (\(p = 5\)) would use Age, HtShoes, Thigh, and Leg.
all_hipcenter_mod$rss
## [1] 47615.79 44834.69 41938.09 41485.01 41313.00 41277.90 41266.80 41261.78
We can obtain the \(\text{RSS}\) for each of these models using $rss. Notice that these are decreasing since the models range from small to large.
Now that we have the \(\text{RSS}\) for each of these models, it is rather easy to obtain \(\text{AIC}\), \(\text{BIC}\), and Adjusted \(R^2\) since they are all a function of \(\text{RSS}\) Also, since we have the models with the best \(\text{RSS}\) for each size, they will result in the models with the best \(\text{AIC}\), \(\text{BIC}\), and Adjusted \(R^2\) for each size. Then by picking from those, we can find the overall best \(\text{AIC}\), \(\text{BIC}\), and Adjusted \(R^2\).
Conveniently, Adjusted \(R^2\) is automatically calculated.
all_hipcenter_mod$adjr2
## [1] 0.6282374 0.6399496 0.6533055 0.6466586 0.6371276 0.6257403 0.6133690
## [8] 0.6000855
To find which model has the highest Adjusted \(R^2\) we can use the which.max() function.
(best_r2_ind = which.max(all_hipcenter_mod$adjr2))
## [1] 3
We can then extract the predictors of that model.
all_hipcenter_mod$which[best_r2_ind, ]
## (Intercept) Age Weight HtShoes Ht Seated
## TRUE TRUE FALSE FALSE TRUE FALSE
## Arm Thigh Leg
## FALSE FALSE TRUE
We'll now calculate \(\text{AIC}\) and \(\text{BIC}\) for each of the models with the best \(\text{RSS}\). To do so, we will need both \(n\) and the \(p\) for the largest possible model.
p = length(coef(hipcenter_mod))
n = length(resid(hipcenter_mod))
We'll use the form of \(\text{AIC}\) which leaves out the constant term that is equal across all models.
\[ \text{AIC} = n\log\left(\frac{\text{RSS}}{n}\right) + 2p. \]
Since we have the \(\text{RSS}\) of each model stored, this is easy to calculate.
hipcenter_mod_aic = n * log(all_hipcenter_mod$rss / n) + 2 * (2:p)
We can then extract the predictors of the model with the best \(\text{AIC}\).
best_aic_ind = which.min(hipcenter_mod_aic)
all_hipcenter_mod$which[best_aic_ind,]
## (Intercept) Age Weight HtShoes Ht Seated
## TRUE TRUE FALSE FALSE TRUE FALSE
## Arm Thigh Leg
## FALSE FALSE TRUE
Let's fit this model so we can compare to our previously chosen models using \(\text{AIC}\) and search procedures.
hipcenter_mod_best_aic = lm(hipcenter ~ Age + Ht + Leg, data = seatpos)
The extractAIC() function will calculate the \(\text{AIC}\) defined above for a fitted model.
extractAIC(hipcenter_mod_best_aic)
## [1] 4.0000 274.2418
extractAIC(hipcenter_mod_back_aic)
## [1] 4.0000 274.2597
extractAIC(hipcenter_mod_forw_aic)
## [1] 4.0000 274.2418
extractAIC(hipcenter_mod_both_aic)
## [1] 4.0000 274.2418
We see that two of the models chosen by search procedures have the best possible \(\text{AIC}\), as they are the same model. This is however never guaranteed. We see that the model chosen using backwards selection does not achieve the smallest possible \(\text{AIC}\).
plot(hipcenter_mod_aic ~ I(2:p), ylab = "AIC", xlab = "p, number of parameters",
pch = 20, col = "dodgerblue", type = "b", cex = 2,
main = "AIC vs Model Complexity")
We could easily repeat this process for \(\text{BIC}\).
\[ \text{BIC} = n\log\left(\frac{\text{RSS}}{n}\right) + \log(n)p. \]
hipcenter_mod_bic = n * log(all_hipcenter_mod$rss / n) + log(n) * (2:p)
which.min(hipcenter_mod_bic)
## [1] 1
all_hipcenter_mod$which[1,]
## (Intercept) Age Weight HtShoes Ht Seated
## TRUE FALSE FALSE FALSE TRUE FALSE
## Arm Thigh Leg
## FALSE FALSE FALSE
hipcenter_mod_best_bic = lm(hipcenter ~ Ht, data = seatpos)
extractAIC(hipcenter_mod_best_bic, k = log(n))
## [1] 2.0000 278.3418
extractAIC(hipcenter_mod_back_bic, k = log(n))
## [1] 2.0000 278.7306
extractAIC(hipcenter_mod_forw_bic, k = log(n))
## [1] 2.0000 278.3418
extractAIC(hipcenter_mod_both_bic, k = log(n))
## [1] 2.0000 278.3418
So far we have only allowed first-order terms in our models. Let's return to the autompg dataset to explore higher-order terms.
autompg = read.table(
"http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
quote = "\"",
comment.char = "",
stringsAsFactors = FALSE)
colnames(autompg) = c("mpg", "cyl", "disp", "hp", "wt", "acc",
"year", "origin", "name")
autompg = subset(autompg, autompg$hp != "?")
autompg = subset(autompg, autompg$name != "plymouth reliant")
rownames(autompg) = paste(autompg$cyl, "cylinder", autompg$year, autompg$name)
autompg$hp = as.numeric(autompg$hp)
autompg$domestic = as.numeric(autompg$origin == 1)
autompg = autompg[autompg$cyl != 5,]
autompg = autompg[autompg$cyl != 3,]
autompg$cyl = as.factor(autompg$cyl)
autompg$domestic = as.factor(autompg$domestic)
autompg = subset(autompg, select = c("mpg", "cyl", "disp", "hp",
"wt", "acc", "year", "domestic"))
str(autompg)
## 'data.frame': 383 obs. of 8 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 3 3 3 3 3 3 3 3 3 3 ...
## $ disp : num 307 350 318 304 302 429 454 440 455 390 ...
## $ hp : num 130 165 150 150 140 198 220 215 225 190 ...
## $ wt : num 3504 3693 3436 3433 3449 ...
## $ acc : num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ domestic: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
Recall that we have two factor variables, cyl and domestic. The cyl variable has three levels, while the domestic variable has only two. Thus the cyl variable will be coded using two dummy variables, while the domestic variable will only need one. We will pay attention to this later.
pairs(autompg, col = "dodgerblue")
We'll use the pairs() plot to determine which variables may benefit from a quadratic relationship with the response. We'll also consider all possible two-way interactions. We won't consider any three-order or higher. For example, we won't consider the interaction between first-order terms and the added quadratic terms.
So now, we'll fit this rather large model. We'll use a log-transformed response. Notice that log(mpg) ~ . ^ 2 will automatically consider all first-order terms, as well as all two-way interactions. We use I(var_name ^ 2) to add quadratic terms for some variables. This generally works better than using poly() when performing variable selection.
autompg_big_mod = lm(
log(mpg) ~ . ^ 2 + I(disp ^ 2) + I(hp ^ 2) + I(wt ^ 2) + I(acc ^ 2),
data = autompg)
We think it is rather unlikely that we truly need all of these terms. There are quite a few!
length(coef(autompg_big_mod))
## [1] 40
We'll try backwards search with both \(\text{AIC}\) and \(\text{BIC}\) to attempt to find a smaller, more reasonable model.
autompg_mod_back_aic = step(autompg_big_mod, direction = "backward", trace = 0)
Notice that we used trace = 0 in the function call. This suppress the output for each step, and simply stores the chosen model. This is useful, as this code would otherwise create a large amount of output. If we had viewed the output, which you can try on your own by removing trace = 0, we would see that R only considers the cyl variable as a single variable, despite the fact that it is coded using two dummy variables. So removing cyl would actually remove two parameters from the resulting model.
You should also notice that R respects hierarchy when attempting to remove variables. That is, for example, R will not consider removing hp if hp:disp or I(hp ^ 2) are currently in the model.
We also use \(\text{BIC}\).
n = length(resid(autompg_big_mod))
autompg_mod_back_bic = step(autompg_big_mod, direction = "backward",
k = log(n), trace = 0)
Looking at the coefficients of the two chosen models, we see they are still rather large.
coef(autompg_mod_back_aic)
## (Intercept) cyl6 cyl8 disp hp
## 3.671884e+00 -1.602563e-01 -8.581644e-01 -9.371971e-03 2.293534e-02
## wt acc year domestic1 I(hp^2)
## -3.064497e-04 -1.393888e-01 -1.966361e-03 9.369324e-01 -1.497669e-05
## cyl6:acc cyl8:acc disp:wt disp:year hp:acc
## 7.220298e-03 5.041915e-02 5.797816e-07 9.493770e-05 -5.062295e-04
## hp:year acc:year acc:domestic1 year:domestic1
## -1.838985e-04 2.345625e-03 -2.372468e-02 -7.332725e-03
coef(autompg_mod_back_bic)
## (Intercept) cyl6 cyl8 disp hp
## 4.657847e+00 -1.086165e-01 -7.611631e-01 -1.609316e-03 2.621266e-03
## wt acc year domestic1 cyl6:acc
## -2.635972e-04 -1.670601e-01 -1.045646e-02 3.341579e-01 4.315493e-03
## cyl8:acc disp:wt hp:acc acc:year acc:domestic1
## 4.610095e-02 4.102804e-07 -3.386261e-04 2.500137e-03 -2.193294e-02
However, they are much smaller than the original full model. Also notice that the resulting models respect hierarchy.
length(coef(autompg_big_mod))
## [1] 40
length(coef(autompg_mod_back_aic))
## [1] 19
length(coef(autompg_mod_back_bic))
## [1] 15
Calculating the LOOCV \(\text{RMSE}\) for each, we see that the model chosen using \(\text{BIC}\) performs the best. That means that it is both the best model for prediction, since it achieves the best LOOCV \(\text{RMSE}\), but also the best model for explanation, as it is also the smallest.
calc_loocv_rmse(autompg_big_mod)
## [1] 0.1112024
calc_loocv_rmse(autompg_mod_back_aic)
## [1] 0.1032888
calc_loocv_rmse(autompg_mod_back_bic)
## [1] 0.103134
Throughout this chapter, we have attempted to find reasonably "small" models, which are good at explaining the relationship between the response and the predictors, that also have small errors which are thus good for making predictions.
We'll further discuss the model autompg_mod_back_bic to better explain the difference between using models for explaining and predicting. This is the model fit to the autompg data that was chosen using Backwards Search and \(\text{BIC}\), which obtained the lowest LOOCV \(\text{RMSE}\) of the models we considered.
autompg_mod_back_bic
##
## Call:
## lm(formula = log(mpg) ~ cyl + disp + hp + wt + acc + year + domestic +
## cyl:acc + disp:wt + hp:acc + acc:year + acc:domestic, data = autompg)
##
## Coefficients:
## (Intercept) cyl6 cyl8 disp hp
## 4.658e+00 -1.086e-01 -7.612e-01 -1.609e-03 2.621e-03
## wt acc year domestic1 cyl6:acc
## -2.636e-04 -1.671e-01 -1.046e-02 3.342e-01 4.315e-03
## cyl8:acc disp:wt hp:acc acc:year acc:domestic1
## 4.610e-02 4.103e-07 -3.386e-04 2.500e-03 -2.193e-02
Notice this is a somewhat "large" model, which uses 15 parameters, including several interaction terms. Do we care that this is a "large" model? The answer is, it depends.
Suppose we would like to use this model for explanation. Perhaps we are a car manufacturer trying to engineer a fuel efficient vehicle. If this is the case, we are interested in both what predictor variables are useful for explaining the car's fuel efficiency, as well as how those variables effect fuel efficiency. By understanding this relationship, we can use this knowledge to our advantage when designing a car.
To explain a relationship, we are interested in keeping models as small as possible, since smaller models are easy to interpret. The fewer predictors the less considerations we need to make in our design process. Also the fewer interactions and polynomial terms, the easier it is to interpret any one parameter, since the parameter interpretations are conditional on which parameters are in the model.
Note that linear models are rather interpretable to begin with. Later in your data analysis careers, you will see more complicated models that may fit data better, but are much harder, if not impossible to interpret. These models aren't very useful for explaining a relationship.
To find small and interpretable models, we would use selection criterion that explicitly penalize larger models, such as AIC and BIC. In this case we still obtained a somewhat large model, but much smaller than the model we used to start the selection process.
A word of caution when using a model to explain a relationship. There are two terms often used to describe a relationship between two variables: causation and correlation. Correlation is often also referred to as association.
Just because two variable are correlated does not necessarily mean that one causes the other. For example, considering modeling mpg as only a function of hp.
plot(mpg ~ hp, data = autompg, col = "dodgerblue", pch = 20, cex = 1.5)
Does an increase in horsepower cause a drop in fuel efficiency? Or, perhaps the causality is reversed and an increase in fuel efficiency cause a decrease in horsepower. Or, perhaps there is a third variable that explains both!
The issue here is that we have observational data. With observational data, we can only detect associations. To speak with confidence about causality, we would need to run experiments.
This is a concept that you should encounter often in your statistics education. For some further reading, and some related fallacies, see: Wikipedia: Correlation does not imply causation.
Suppose now instead of the manufacturer who would like to build a car, we are a consumer who wishes to purchase a new car. However this particular car is so new, it has not been rigorously tested, so we are unsure of what fuel efficiency to expect. (And, as skeptics, we don't trust what the manufacturer is telling us.)
In this case, we would like to use the model to help predict the fuel efficiency of this car based on its attributes, which are the predictors of the model. The smaller the errors the model makes, the more confident we are in its prediction. Thus, to find models for prediction, we would use selection criterion that implicitly penalize larger models, such as LOOCV \(\text{RMSE}\). So long as the model does not over-fit, we do not actually care how large the model becomes. Explaining the relationship between the variables is not our goal here, we simply want to know what kind of fuel efficiency we should expect!
If we only care about prediction, we don't need to worry about correlation vs causation, and we don't need to worry about model assumptions.
If a variable is correlated with the response, it doesn't actually matter if it causes an effect on the response, it can still be useful for prediction. For example, in elementary school aged children their shoe size certainly doesn't cause them to read at a higher level, however we could very easily use shoe size to make a prediction about a child's reading ability. The larger their shoe size, the better they read. There's a lurking variable here though, their age! (Don't send your kids to school with size 14 shoes, it won't make them read better!)
We also don't care about model assumptions. Least squares is least squares. For a specified model, it will find the values of the parameters which will minimize the squared error loss. Your results might be largely uninterpretable and useless for inference, but for prediction none of that matters.
We described how methods such as LDA and QDA are not meant to be used with many predictors \(p\) because the number of parameters that we need to estimate becomes too large. For example, with the digits example \(p=784\), we would have over 600,000 parameters with LDA, and we would multiply that by the number of classes for QDA. Kernel methods such as kNN or local regression do not have model parameters to estimate. However, they also face a challenge when multiple predictors are used due to what is referred to as the curse of dimensionality. The dimension here refers to the fact that when we have \(p\) predictors, the distance between two observations is computed in \(p\)-dimensional space.
A useful way of understanding the curse of dimensionality is by considering how large we have to make a span/neighborhood/window to include a given percentage of the data. Remember that with larger neighborhoods, our methods lose flexibility.
For example, suppose we have one continuous predictor with equally spaced points in the [0,1] interval and we want to create windows that include 1/10th of data. Then it's easy to see that our windows have to be of size 0.1:
Now, for two predictors, if we decide to keep the neighborhood just as small, 10% for each dimension, we include only 1 point. If we want to include 10% of the data, then we need to increase the size of each side of the square to \(\sqrt{.10} \approx .316\):
Using the same logic, if we want to include 10% of the data in a three-dimensional space, then the side of each cube is \(\sqrt[3]{.10} \approx 0.464\). In general, to include 10% of the data in a case with \(p\) dimensions, we need an interval with each side of size \(\sqrt[p]{.10}\) of the total. This proportion gets close to 1 quickly, and if the proportion is 1 it means we include all the data and are no longer smoothing.
library(tidyverse)
p <- 1:100
qplot(p, .1^(1/p), ylim = c(0,1))
By the time we reach 100 predictors, the neighborhood is no longer very local, as each side covers almost the entire dataset.
Here we look at a set of elegant and versatile methods that adapt to higher dimensions and also allow these regions to take more complex shapes while still producing models that are interpretable. These are very popular, well-known and studied methods. We will concentrate on regression and decision trees and their extension to random forests.
To motivate this section, we will use a new dataset that includes the breakdown of the composition of olive oil into 8 fatty acids:
library(tidyverse)
library(dslabs)
data("olive")
names(olive)
## [1] "region" "area" "palmitic" "palmitoleic" "stearic"
## [6] "oleic" "linoleic" "linolenic" "arachidic" "eicosenoic"
For illustrative purposes, we will try to predict the region using the fatty acid composition values as predictors.
table(olive$region)
##
## Northern Italy Sardinia Southern Italy
## 151 98 323
We remove the area column because we won't use it as a predictor.
olive <- select(olive, -area)
Let's very quickly try to predict the region using kNN:
library(caret)
fit <- train(region ~ ., method = "knn",
tuneGrid = data.frame(k = seq(1, 15, 2)),
data = olive)
ggplot(fit)
We see that using just one neighbor, we can predict relatively well. However, a bit of data exploration reveals that we should be able to do even better. For example, if we look at the distribution of each predictor stratified by region we see that eicosenoic is only present in Southern Italy and that linoleic separates Northern Italy from Sardinia.
olive %>% gather(fatty_acid, percentage, -region) %>%
ggplot(aes(region, percentage, fill = region)) +
geom_boxplot() +
facet_wrap(~fatty_acid, scales = "free", ncol = 4) +
theme(axis.text.x = element_blank(), legend.position="bottom")
This implies that we should be able to build an algorithm that predicts perfectly! We can see this clearly by plotting the values for eicosenoic and linoleic.
olive %>%
ggplot(aes(eicosenoic, linoleic, color = region)) +
geom_point() +
geom_vline(xintercept = 0.065, lty = 2) +
geom_segment(x = -0.2, y = 10.54, xend = 0.065, yend = 10.54,
color = "black", lty = 2)
In Section @ref(predictor-space) we define predictor spaces. The predictor space here consists of eight-dimensional points with values between 0 and 100. In the plot above, we show the space defined by the two predictors eicosenoic and linoleic, and, by eye, we can construct a prediction rule that partitions the predictor space so that each partition contains only outcomes of a one category. This in turn can be used to define an algorithm with perfect accuracy. Specifically, we define the following decision rule. If eicosenoic is larger than 0.065, predict Southern Italy. If not, then if linoleic is larger than \(10.535\), predict Sardinia, and if lower, predict Northern Italy. We can draw this decision tree like this:
Decision trees like this are often used in practice. For example, to decide on a person's risk of poor outcome after having a heart attack, doctors use the following:
(Source: Walton 2010 Informal Logic, Vol. 30, No. 2, pp. 159-1841.)
A tree is basically a flow chart of yes or no questions. The general idea of the methods we are describing is to define an algorithm that uses data to create these trees with predictions at the ends, referred to as nodes. Regression and decision trees operate by predicting an outcome variable \(Y\) by partitioning the predictors.
When the outcome is continuous, we call the method a regression tree. To introduce regression trees, we will use the 2008 poll data used in previous sections to describe the basic idea of how we build these algorithms. As with other machine learning algorithms, we will try to estimate the conditional expectation \(f(x) = \mbox{E}(Y | X = x)\) with \(Y\) the poll margin and \(x\) the day.
data("polls_2008")
qplot(day, margin, data = polls_2008)
The general idea here is to build a decision tree and, at the end of each node, obtain a predictor \(\hat{y}\). A mathematical way to describe this is to say that we are partitioning the predictor space into \(J\) non-overlapping regions, \(R_1, R_2, \ldots, R_J\), and then for any predictor \(x\) that falls within region \(R_j\), estimate \(f(x)\) with the average of the training observations \(y_i\) for which the associated predictor \(x_i\) is also in \(R_j\).
But how do we decide on the partition \(R_1, R_2, \ldots, R_J\) and how do we choose \(J\)? Here is where the algorithm gets a bit complicated.
Regression trees create partitions recursively. We start the algorithm with one partition, the entire predictor space. In our simple first example, this space is the interval [-155, 1]. But after the first step we will have two partitions. After the second step we will split one of these partitions into two and will have three partitions, then four, then five, and so on. We describe how we pick the partition to further partition, and when to stop, later.
Once we select a partition \(\mathbf{x}\) to split in order to create the new partitions, we find a predictor \(j\) and value \(s\) that define two new partitions, which we will call \(R_1(j,s)\) and \(R_2(j,s)\), that split our observations in the current partition by asking if \(x_j\) is bigger than \(s\):
\[ R_1(j,s) = \{\mathbf{x} \mid x_j < s\} \mbox{ and } R_2(j,s) = \{\mathbf{x} \mid x_j \geq s\} \]
In our current example we only have one predictor, so we will always choose \(j=1\), but in general this will not be the case. Now, after we define the new partitions \(R_1\) and \(R_2\), and we decide to stop the partitioning, we compute predictors by taking the average of all the observations \(y\) for which the associated \(\mathbf{x}\) is in \(R_1\) and \(R_2\). We refer to these two as \(\hat{y}_{R_1}\) and \(\hat{y}_{R_2}\) respectively.
But how do we pick \(j\) and \(s\)? Basically we find the pair that minimizes the residual sum of square (RSS): \[ \sum_{i:\, x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i:\, x_i \in R_2(j,s)} (y_i - \hat{y}_{R_2})^2 \]
This is then applied recursively to the new regions \(R_1\) and \(R_2\). We describe how we stop later, but once we are done partitioning the predictor space into regions, in each region a prediction is made using the observations in that region.
Let's take a look at what this algorithm does on the 2008 presidential election poll data. We will use the rpart function in the rpart package.
library(rpart)
fit <- rpart(margin ~ ., data = polls_2008)
Here, there is only one predictor. Thus we do not have to decide which predictor \(j\) to split by, we simply have to decide what value \(s\) we use to split. We can visually see where the splits were made:
plot(fit, margin = 0.1)
text(fit, cex = 0.75)
The first split is made on day 39.5. One of those regions is then split at day 86.5. The two resulting new partitions are split on days 49.5 and 117.5, respectively, and so on. We end up with 8 partitions. The final estimate \(\hat{f}(x)\) looks like this:
polls_2008 %>%
mutate(y_hat = predict(fit)) %>%
ggplot() +
geom_point(aes(day, margin)) +
geom_step(aes(day, y_hat), col="red")
Note that the algorithm stopped partitioning at 8. Now we explain how this decision is made.
First we need to define the term complexity parameter (cp). Every time we split and define two new partitions, our training set RSS decreases. This is because with more partitions, our model has more flexibility to adapt to the training data. In fact, if you split until every point is its own partition, then RSS goes all the way down to 0 since the average of one value is that same value. To avoid this, the algorithm sets a minimum for how much the RSS must improve for another partition to be added. This parameter is referred to as the complexity parameter (cp). The RSS must improve by a factor of cp for the new partition to be added. Large values of cp will therefore force the algorithm to stop earlier which results in fewer nodes.
However, cp is not the only parameter used to decide if we should partition a current partition or not. Another common parameter is the minimum number of observations required in a partition before partitioning it further. The argument used in the rpart function is minsplit and the default is 20. The rpart implementation of regression trees also permits users to determine a minimum number of observations in each node. The argument is minbucket and defaults to round(minsplit/3).
As expected, if we set cp = 0 and minsplit = 2, then our prediction is as flexible as possible and our predictor is our original data:
fit <- rpart(margin ~ ., data = polls_2008,
control = rpart.control(cp = 0, minsplit = 2))
polls_2008 %>%
mutate(y_hat = predict(fit)) %>%
ggplot() +
geom_point(aes(day, margin)) +
geom_step(aes(day, y_hat), col="red")
Intuitively we know that this is not a good approach as it will generally result in over-training. These cp, minsplit, and minbucket, three parameters can be used to control the variability of the final predictors. The larger these values are the more data is averaged to compute a predictor and thus reduce variability. The drawback is that it restricts flexibility.
So how do we pick these parameters? We can use cross validation, described in Chapter @ref(cross-validation), just like with any tuning parameter. Here is an example of using cross validation to chose cp.
library(caret)
train_rpart <- train(margin ~ .,
method = "rpart",
tuneGrid = data.frame(cp = seq(0, 0.05, len = 25)),
data = polls_2008)
ggplot(train_rpart)
To see the resulting tree, we access the finalModel and plot it:
plot(train_rpart$finalModel, margin = 0.1)
text(train_rpart$finalModel, cex = 0.75)
And because we only have one predictor, we can actually plot \(\hat{f}(x)\):
polls_2008 %>%
mutate(y_hat = predict(train_rpart)) %>%
ggplot() +
geom_point(aes(day, margin)) +
geom_step(aes(day, y_hat), col="red")
Note that if we already have a tree and want to apply a higher cp value, we can use the prune function. We call this pruning a tree because we are snipping off partitions that do not meet a cp criterion. We previously created a tree that used a cp = 0 and saved it to fit. We can prune it like this:
pruned_fit <- prune(fit, cp = 0.01)
Classification trees, or decision trees, are used in prediction problems where the outcome is categorical. We use the same partitioning principle with some differences to account for the fact that we are now working with a categorical outcome.
The first difference is that we form predictions by calculating which class is the most common among the training set observations within the partition, rather than taking the average in each partition (as we can't take the average of categories).
The second is that we can no longer use RSS to choose the partition. While we could use the naive approach of looking for partitions that minimize training error, better performing approaches use more sophisticated metrics. Two of the more popular ones are the Gini Index and Entropy.
In a perfect scenario, the outcomes in each of our partitions are all of the same category since this will permit perfect accuracy. The Gini Index is going to be 0 in this scenario, and become larger the more we deviate from this scenario. To define the Gini Index, we define \(\hat{p}_{j,k}\) as the proportion of observations in partition \(j\) that are of class \(k\). The Gini Index is defined as
\[ \mbox{Gini}(j) = \sum_{k=1}^K \hat{p}_{j,k}(1-\hat{p}_{j,k}) \]
If you study the formula carefully you will see that it is in fact 0 in the perfect scenario described above.
Entropy is a very similar quantity, defined as
\[ \mbox{entropy}(j) = -\sum_{k=1}^K \hat{p}_{j,k}\log(\hat{p}_{j,k}), \mbox{ with } 0 \times \log(0) \mbox{ defined as }0 \]
Let us look at how a classification tree performs on the digits example we examined before:
We can use this code to run the algorithm and plot the resulting tree:
train_rpart <- train(y ~ .,
method = "rpart",
tuneGrid = data.frame(cp = seq(0.0, 0.1, len = 25)),
data = mnist_27$train)
plot(train_rpart)
The accuracy achieved by this approach is better than what we got with regression, but is not as good as what we achieved with kernel methods:
y_hat <- predict(train_rpart, mnist_27$test)
confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"]
## Accuracy
## 0.82
The plot of the estimated conditional probability shows us the limitations of classification trees:
Note that with decision trees, it is difficult to make the boundaries smooth since each partition creates a discontinuity.
Classification trees have certain advantages that make them very useful. They are highly interpretable, even more so than linear models. They are easy to visualize (if small enough). Finally, they can model human decision processes and don't require use of dummy predictors for categorical variables. On the other hand, the approach via recursive partitioning can easily over-train and is therefore a bit harder to train than, for example, linear regression or kNN. Furthermore, in terms of accuracy, it is rarely the best performing method since it is not very flexible and is highly unstable to changes in training data. Random forests, explained next, improve on several of these shortcomings.
Random forests are a very popular machine learning approach that addresses the shortcomings of decision trees using a clever idea. The goal is to improve prediction performance and reduce instability by averaging multiple decision trees (a forest of trees constructed with randomness). It has two features that help accomplish this.
The first step is bootstrap aggregation or bagging. The general idea is to generate many predictors, each using regression or classification trees, and then forming a final prediction based on the average prediction of all these trees. To assure that the individual trees are not the same, we use the bootstrap to induce randomness. These two features combined explain the name: the bootstrap makes the individual trees randomly different, and the combination of trees is the forest. The specific steps are as follows.
1. Build \(B\) decision trees using the training set. We refer to the fitted models as \(T_1, T_2, \dots, T_B\). We later explain how we ensure they are different.
2. For every observation in the test set, form a prediction \(\hat{y}_j\) using tree \(T_j\).
3. For continuous outcomes, form a final prediction with the average \(\hat{y} = \frac{1}{B} \sum_{j=1}^B \hat{y}_j\). For categorical data classification, predict \(\hat{y}\) with majority vote (most frequent class among \(\hat{y}_1, \dots, \hat{y}_T\)).
So how do we get different decision trees from a single training set? For this, we use randomness in two ways which we explain in the steps below. Let \(N\) be the number of observations in the training set. To create \(T_j, \, j=1,\ldots,B\) from the training set we do the following:
1. Create a bootstrap training set by sampling \(N\) observations from the training set with replacement. This is the first way to induce randomness.
2. A large number of features is typical in machine learning challenges. Often, many features can be informative but including them all in the model may result in overfitting. The second way random forests induce randomness is by randomly selecting features to be included in the building of each tree. A different random subset is selected for each tree. This reduces correlation between trees in the forest, thereby improving prediction accuracy.
To illustrate how the first steps can result in smoother estimates we will demonstrate by fitting a random forest to the 2008 polls data. We will use the randomForest function in the randomForest package:
library(randomForest)
fit <- randomForest(margin~., data = polls_2008)
Note that if we apply the function plot to the resulting object, stored in fit, we see how the error rate of our algorithm changes as we add trees.
rafalib::mypar()
plot(fit)
We can see that in this case, the accuracy improves as we add more trees until about 30 trees where accuracy stabilizes.
The resulting estimate for this random forest can be seen like this:
polls_2008 %>%
mutate(y_hat = predict(fit, newdata = polls_2008)) %>%
ggplot() +
geom_point(aes(day, margin)) +
geom_line(aes(day, y_hat), col="red")
Notice that the random forest estimate is much smoother than what we achieved with the regression tree in the previous section. This is possible because the average of many step functions can be smooth. We can see this by visually examining how the estimate changes as we add more trees. In the following figure you see each of the bootstrap samples for several values of \(b\) and for each one we see the tree that is fitted in grey, the previous trees that were fitted in lighter grey, and the result of averaging all the trees estimated up to that point.
Here is the random forest fit for our digits example based on two predictors:
library(randomForest)
train_rf <- randomForest(y ~ ., data=mnist_27$train)
confusionMatrix(predict(train_rf, mnist_27$test),
mnist_27$test$y)$overall["Accuracy"]
## Accuracy
## 0.79
Here is what the conditional probabilities look like:
Visualizing the estimate shows that, although we obtain high accuracy, it appears that there is room for improvement by making the estimate smoother. This could be achieved by changing the parameter that controls the minimum number of data points in the nodes of the tree. The larger this minimum, the smoother the final estimate will be. We can train the parameters of the random forest. Below, we use the caret package to optimize over the minimum node size. Because, this is not one of the parameters that the caret package optimizes by default we will write our own code:
nodesize <- seq(1, 51, 10)
acc <- sapply(nodesize, function(ns){
train(y ~ ., method = "rf", data = mnist_27$train,
tuneGrid = data.frame(mtry = 2),
nodesize = ns)$results$Accuracy
})
qplot(nodesize, acc)
We can now fit the random forest with the optimized minimun node size to the entire training data and evaluate performance on the test data.
train_rf_2 <- randomForest(y ~ ., data=mnist_27$train,
nodesize = nodesize[which.max(acc)])
confusionMatrix(predict(train_rf_2, mnist_27$test),
mnist_27$test$y)$overall["Accuracy"]
## Accuracy
## 0.815
The selected model improves accuracy and provides a smoother estimate.
Note that we can avoid writing our own code by using other random forest implementations as described in the caret manual2.
Random forest performs better in all the examples we have considered. However, a disadvantage of random forests is that we lose interpretability. An approach that helps with interpretability is to examine variable importance. To define variable importance we count how often a predictor is used in the individual trees. You can learn more about variable importance in an advanced machine learning book3. The caret package includes the function varImp that extracts variable importance from any model in which the calculation is implemented. We give an example on how we use variable importance in the next section.
1. Create a simple dataset where the outcome grows 0.75 units on average for every increase in a predictor:
n <- 1000
sigma <- 0.25
x <- rnorm(n, 0, 1)
y <- 0.75 * x + rnorm(n, 0, sigma)
dat <- data.frame(x = x, y = y)
Use rpart to fit a regression tree and save the result to fit.
2. Plot the final tree so that you can see where the partitions occurred.
3. Make a scatterplot of y versus x along with the predicted values based on the fit.
4. Now model with a random forest instead of a regression tree using randomForest from the randomForest package, and remake the scatterplot with the prediction line.
5. Use the function plot to see if the random forest has converged or if we need more trees.
6. It seems that the default values for the random forest result in an estimate that is too flexible (not smooth). Re-run the random forest but this time with nodesize set at 50 and maxnodes set at 25. Remake the plot.
7. We see that this yields smoother results. Let's use the train function to help us pick these values. From the caret manual4 we see that we can't tune the maxnodes parameter or the nodesize argument with randomForest, so we will use the Rborist package and tune the minNode argument. Use the train function to try values minNode <- seq(5, 250, 25). See which value minimizes the estimated RMSE.
8. Make a scatterplot along with the prediction from the best fitted model.
9. Use the rpart function to fit a classification tree to the tissue_gene_expression dataset. Use the train function to estimate the accuracy. Try out cp values of seq(0, 0.05, 0.01). Plot the accuracy to report the results of the best model.
10. Study the confusion matrix for the best fitting classification tree. What do you observe happening for placenta?
11. Notice that placentas are called endometrium more often than placenta. Note also that the number of placentas is just six, and that, by default, rpart requires 20 observations before splitting a node. Thus it is not possible with these parameters to have a node in which placentas are the majority. Rerun the above analysis but this time permit rpart to split any node by using the argument control = rpart.control(minsplit = 0). Does the accuracy increase? Look at the confusion matrix again.
12. Plot the tree from the best fitting model obtained in exercise 11.
13. We can see that with just six genes, we are able to predict the tissue type. Now let's see if we can do even better with a random forest. Use the train function and the rf method to train a random forest. Try out values of mtry ranging from, at least, seq(50, 200, 25). What mtry value maximizes accuracy? To permit small nodesize to grow as we did with the classification trees, use the following argument: nodesize = 1. This will take several seconds to run. If you want to test it out, try using smaller values with ntree. Set the seed to 1990.
14. Use the function varImp on the output of train and save it to an object called imp.
15. The rpart model we ran above produced a tree that used just six predictors. Extracting the predictor names is not straightforward, but can be done. If the output of the call to train was fit_rpart, we can extract the names like this:
ind <- !(fit_rpart$finalModel$frame$var == "<leaf>")
tree_terms <-
fit_rpart$finalModel$frame$var[ind] %>%
unique() %>%
as.character()
tree_terms
What is the variable importance in the random forest call for these predictors? Where do they rank?
16. Advanced: Extract the top 50 predictors based on importance, take a subset of x with just these predictors and apply the function heatmap to see how these genes behave across the tissues. We will introduce the heatmap function in Chapter @ref(clustering).
Acknowledgments All the material for this lab was taken from the Data Science Book,, by Rafael Irizarry.